Linear Regression + Simulating Distributions

Monday, May 29

Today we will…

  • New Material
    • Review of Simple Linear Regression
    • Assessing Model Fit
  • PA 9.1: Mystery Animal
  • Lab 9: Baby Names

Some Data

NC Births Data

This dataset contains a random sample of 1,000 births from North Carolina in 2004 (sampled from a larger dataset).

  • Each case describes the birth of a single child, including characteristics of the:
    • child (birth weight, length of gestation, etc.).
    • birth parents (age, weight gained during pregnancy, smoking habits, etc.).

NC Births Data

library(openintro)
data(ncbirths)
slice_sample(ncbirths, n = 10) |> 
  knitr::kable() |> 
  kableExtra::scroll_box(height = "400px") |> 
  kableExtra::kable_styling(font_size = 30)
fage mage mature weeks premie visits marital gained weight lowbirthweight gender habit whitemom
NA 19 younger mom 42 full term 10 not married 25 9.50 not low female nonsmoker not white
21 18 younger mom 40 full term 13 not married 10 5.50 low female nonsmoker white
34 29 younger mom 38 full term 16 married 35 6.19 not low female nonsmoker white
NA 25 younger mom 39 full term 14 married 20 6.88 not low male nonsmoker white
36 27 younger mom 40 full term 11 not married 18 7.75 not low male nonsmoker not white
NA 24 younger mom 37 full term 5 married 34 7.38 not low male smoker white
41 39 mature mom 39 full term 11 married 30 8.06 not low female nonsmoker white
26 21 younger mom 44 full term 11 married 15 7.50 not low female nonsmoker white
32 21 younger mom 40 full term 11 married 58 7.56 not low female nonsmoker white
42 38 mature mom 39 full term 15 married 50 8.94 not low female nonsmoker not white

Relationships Between Variables

Relationships Between Variables

In statistical models, we generally have one variable that is the response and one or more variables that are explanatory.

  • Response variable
    • Also: \(y\), dependent variable
    • This is the quantity we want to understand.
  • Explanatory variable
    • Also: \(x\), independent variable, predictor
    • This is something we think might be related to the response.

Visualizing a Relationship

The scatterplot has been called the most “generally useful invention in the history of statistical graphics.”

Characterizing Relationships

  • Form: linear, quadratic, non-linear?
  • Direction: positive, negative?
  • Strength: how much scatter/noise?
  • Unusual observations: do points not fit the overall pattern?

Your turn!

How would you characterize this relationship?

  • Shape?
  • Direction?
  • Strength?
  • Outliers?

Note: Much of what we are doing at this stage involves making judgment calls!

Fitting a Model

We often assume the value of our response variable is some function of our explanatory variable, plus some random noise.

\[response = f(explanatory) + noise\]

  • There is a mathematical function \(f\) that can translate values of one variable into values of another.
  • But there is some randomness in the process.

Simple Linear Regression (SLR)

If we assume the relationship between \(x\) and \(y\) takes the form of a linear function

\[ response = intercept + slope \times explanatory + noise \]

We use the following notation for this model:

Population Regression Model

\(Y = \beta_0 + \beta_1 X + \varepsilon\)

where \(\varepsilon \sim N(0, \sigma)\) is the random noise.

Fitted Regression Model

\(\hat{y}= \hat{\beta}_0 + \hat{\beta}_1 x\)

where   \(\hat{}\)   indicates the value was estimated.

Fitting an SLR Model

Regress baby birthweight (response variable) on the pregnant parent’s weight gain (explanatory variable).

  • We are assuming there is a linear relationship between how much weight the parent gains and how much the baby weighs at birth.

When visualizing data, fit a regression line (\(y\) on \(x\)) to your scatterplot.

ncbirths |> 
  ggplot(aes(x = gained, y = weight)) +
  geom_jitter() + 
  geom_smooth(method = "lm") 

The lm() function fits a linear regression model.

  • We use formula notation to specify the response variable (LHS) and the explanatory variable (RHS).
  • This code creates an lm object.
ncbirth_lm <- lm(weight ~ gained, 
                 data = ncbirths)
broom::tidy(ncbirth_lm)
# A tibble: 2 × 5
  term        estimate std.error statistic    p.value
  <chr>          <dbl>     <dbl>     <dbl>      <dbl>
1 (Intercept)   6.63     0.111       59.6  0         
2 gained        0.0161   0.00332      4.86 0.00000135
  • Intercept: expected mean of \(y\), when \(x\) is 0.
    • Someone gaining 0 lb, will have a baby weighing 6.63 lbs, on average.
  • Slope: expected change in the mean of \(y\), when \(x\) increases by 1 unit.
    • For each pound gained, the baby will weigh 0.016 lbs more, on average.

The difference between observed (point) and expected (line).

ncbirth_lm$residuals |> 
  head(3)
         1          2          3 
 0.3866279  0.9271567 -0.6133721 
resid(ncbirth_lm) |> 
  head(3)
         1          2          3 
 0.3866279  0.9271567 -0.6133721 
broom::augment(ncbirth_lm) |> 
  head(3)
# A tibble: 3 × 9
  .rownames weight gained .fitted .resid    .hat .sigma   .cooksd .std.resid
  <chr>      <dbl>  <int>   <dbl>  <dbl>   <dbl>  <dbl>     <dbl>      <dbl>
1 1           7.63     38    7.24  0.387 0.00133   1.47 0.0000458      0.262
2 2           7.88     20    6.95  0.927 0.00157   1.47 0.000311       0.630
3 3           6.63     38    7.24 -0.613 0.00133   1.47 0.000115      -0.416

Model Diagnostics

There are four conditions that must be met for a linear regression model to be appropriate:

  1. A linear relationship.
  2. Independent observations.
  3. Normally distributed residuals.
  4. Residuals have constant variance.

Model Diagnostics

Is the relationship linear?

  • Almost nothing will look perfectly linear.
  • Be careful with relationships that have curvature.
  • Try transforming your variables!

Are the observations independent?   Difficult to tell!

What does independence mean?

Should not be able to know the \(y\) value for one observation based on the \(y\) value for another observation.

Independence violations:

  • Stock market prices over time.
  • Geographical similarities.
  • Biological conditions of family members.
  • Repeated observations.

Are the residuals normally distributed?

Observations should be symmetric around the regression line.

Less important than linearity or independence:

  • Always get an unbiased estimate of the population model.
  • Large sample sizes make the model more reliable.

Do the residuals have equal (constant) variance?

  • The variability of points around the regression line is roughly constant.
  • Data with non-equal variance across the range of \(x\) can seriously mis-estimate the variability of the slope.
ncbirth_lm |> 
  augment() |> 
ggplot(aes(x = .fitted, y = .resid)) +
  geom_point()

Assessing Model Fit

Sum of Square Errors (SSE)

  • This is calculated as the sum of the squared residuals.
  • A small SSE means small differences between observed and fitted values.
  • Also: Sum Sq of Residuals or deviance.
anova(ncbirth_lm)
Analysis of Variance Table

Response: weight
           Df  Sum Sq Mean Sq F value    Pr(>F)    
gained      1   51.36  51.357  23.642 1.353e-06 ***
Residuals 971 2109.32   2.172                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Assessing Model Fit

Root Mean Square Error (RMSE)

  • The standard deviation of the residuals.
  • A small RMSE means small differences between observed and fitted values.
  • Also: Residual standard error or sigma.
summary(ncbirth_lm)

Call:
lm(formula = weight ~ gained, data = ncbirths)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4085 -0.6950  0.1643  0.9222  4.5158 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.63003    0.11120  59.620  < 2e-16 ***
gained       0.01614    0.00332   4.862 1.35e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.474 on 971 degrees of freedom
  (27 observations deleted due to missingness)
Multiple R-squared:  0.02377,   Adjusted R-squared:  0.02276 
F-statistic: 23.64 on 1 and 971 DF,  p-value: 1.353e-06

Assessing Model Fit

R-squared

  • The proportion of variability in response accounted for by the linear model.
  • A large R-squared means the explanatory variable is good at explaining the response.
  • R-squared is between 0 and 1.
broom::glance(ncbirth_lm)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic    p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>      <dbl> <dbl>  <dbl> <dbl> <dbl>
1    0.0238        0.0228  1.47      23.6 0.00000135     1 -1757. 3520. 3535.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Model Comparison

Regress baby birthweight on…

… gestation weeks.

weight_weeks <- lm(weight ~ weeks, 
                   data = ncbirths)
  • SSE = 1246.55
  • RMSE = 1.119
  • \(R^2\) = 0.449

… number of doctor visits.

weight_visits <- lm(weight ~ visits, 
                    data = ncbirths)
  • SSE = 2152.74
  • RMSE = 1.475
  • \(R^2\) = 0.01819

Why does it make sense that the left model is better?

Multiple Linear Regression

When fitting a linear regression, you can include…

…multiple explanatory variables.

lm(y ~ x1 + x2 + x3 + ... + xn)

…interaction terms.

lm(y ~ x1 + x2 + x1:x2)

lm(y ~ x1*x2)

…quadratic relationships.

lm(y ~ I(x1^2) + x1)

Work Time:

To do…

  • PA 9.1: Mystery Animal
    • Due Friday, 6/2 at 11:59pm
  • Lab 9: Baby Names
    • Due Friday, 6/2 at 11:59pm
  • Challenge 9: Formatting Nice Tables
    • Due Saturday, 6/3 at 11:59pm
  • Project Check-point 2: Linear Regression
    • Due Monday, 6/5 at 11:59pm

Wednesday, May 31

Today we will…

  • New Material
    • Simulating Distributions
  • Work Time
    • PA 9.2: Instrument Con
    • Lab & Challenge 9
    • Project: Linear Regression

Statistical Distributions

Recall from your statistics classes…

A random variable is a value we don’t know until we take a sample.

  • Coin flip: could be heads (0) or tails (1)
  • Person’s height: could be anything from 0 feet to 10 feet.
  • Annual income of a US worker: could be anything from $0 to $1.6 billion

The distribution of a random variable tells us its possible values and how likely they are.

  • Coin flip: 50% chance of heads and tails.
  • Heights follow a bell curve centered at 5 foot 7.
  • Most American workers make under $100,000.

Named Distributions

Uniform Distribution

  • When you know the range of values, but not much else.
  • All values in the range are equally likely to occur.

Normal Distribution

  • When you expect values to fall near the center.
  • Frequency of values follows a bell shaped curve.

t-Distribution

  • A slightly wider bell curve.
  • Basically used in the same context as the normal distribution, but more common with real data (when the standard deviation is unknown).

Chi-Square Distribution

  • Somewhat skewed, and only allows values above zero.
  • Used in testing count data.

Binomial Distribution

  • Appears when you have two possible outcomes, and you are counting how many times each outcome occurred.

How do we use distributions?

  • Find the probability of an event.
    • If I flip 10 coins, what are the chances I get all heads?
  • Estimate a proportion of a population.
    • About what proportion of people are above 6 feet tall?
  • Quantify the evidence in your data.
    • In my survey of 100 people, 67 said they were voting for Measure A. How confident am I that Measure A will pass?

Using Distributions in R

r is for random sampling.

  • Generate random values from a distribution.
  • We use this to simulate data (create pretend observations).
runif(n = 3, min = 10, max = 20)
[1] 15.12827 13.04900 12.25474
rnorm(n = 3)
[1]  0.06697959 -0.18697535 -0.77459597
rnorm(n = 3, mean = -100, sd = 50)
[1] -129.17892 -224.33526  -79.85053
rt(n = 3, df = 11)
[1] 1.49673600 0.01106858 0.33403851
rbinom(n = 3, size = 10, prob = 0.7)
[1] 5 7 6
rchisq(n = 3, df = 11)
[1]  8.693145  9.336317 11.465547

p is for probability.

  • Compute the chances of observing a value less than x.
  • We use this for calculating p-values.
pnorm(q = 1.5)
[1] 0.9331928
pnorm(q = 70, mean = 67, sd = 3)
[1] 0.8413447
1 - pnorm(q = 70, mean = 67, sd = 3)
[1] 0.1586553
pnorm(q = 70, mean = 67, sd = 3, lower.tail = FALSE)
[1] 0.1586553

q is for quantile.

  • Given a probability \(p\), compute \(x\) such that \(P(X < x) = p\).
  • The q functions are “backwards” of the p functions.
qnorm(p = 0.95)
[1] 1.644854
qnorm(p = 0.95, mean = 67, sd = 3)
[1] 71.93456

d is for density.

  • Compute the height of a distribution curve at a given \(x\).
  • For discrete dist: probability of getting exactly \(x\).
  • For continuous dist: usually meaningless.

Probability of exactly 12 heads in 20 coin tosses, with a 70% chance of tails?

dbinom(x = 12, size = 20, prob = 0.3)
[1] 0.003859282

Simulating Data

We can generate fake data based on the assumption that a variable follows a certain distribution.

  • We randomly sample observations from the distribution.
age <- runif(1000, min = 15, max = 75)

Since there is randomness involved, we will get a different result each time we run the code.

runif(3, min = 15, max = 75)
[1] 28.84192 42.09009 40.36888
runif(3, min = 15, max = 75)
[1] 47.20604 22.86354 27.27028

To make a reproducible random sample, we first set the seed:

set.seed(94301)
runif(3, min = 15, max = 75)
[1] 59.68333 30.27768 38.29962
set.seed(94301)
runif(3, min = 15, max = 75)
[1] 59.68333 30.27768 38.29962
set.seed(435)
fake_data <- tibble(names   = charlatan::ch_name(1000),
        height  = rnorm(1000, mean = 67, sd = 3),
        age     = runif(1000, min = 15, max = 75),
        measure = rbinom(1000, size = 1, prob = 0.6)) |> 
  mutate(supports_measure_A = ifelse(measure == 1, "yes", "no"))
head(fake_data)
# A tibble: 6 × 5
  names                 height   age measure supports_measure_A
  <chr>                  <dbl> <dbl>   <int> <chr>             
1 Elbridge Kautzer        67.4  66.3       1 yes               
2 Brandon King            65.0  61.5       0 no                
3 Phyllis Thompson        68.1  53.8       1 yes               
4 Humberto Corwin         67.5  33.9       1 yes               
5 Theresia Koelpin        71.4  16.1       1 yes               
6 Hayden O'Reilly-Johns   66.2  37.0       0 no                

Check to see the ages look uniformly distributed.

Code
fake_data |> 
  ggplot(aes(y = supports_measure_A, 
             x = age,
             fill = supports_measure_A)) +
  ggridges::geom_density_ridges(show.legend = F) +
  scale_fill_brewer(palette = "Paired") +
  theme_bw() +
  labs(x = "Age (years)",
       y = "",
       subtitle = "Support for Measure A",)

Plotting Distributions

fake_data |> 
  ggplot(aes(x = height)) +
  geom_histogram(bins = 10, color = "white")

  1. Plot your data.
fake_data |> 
  ggplot(aes(x = height)) +
  geom_histogram(bins = 10, color = "white") +
  stat_function(fun = ~ dnorm(.x, mean = 67, sd = 3),
                color = "steelblue", lwd = 2)

  1. Add a density curve.
fake_data |> 
  ggplot(aes(x = height)) +
  geom_histogram(aes(y = ..density..),
                 bins = 10, color = "white") +
  stat_function(fun = ~ dnorm(.x, mean = 67, sd = 3),
                color = "steelblue", lwd = 2)

  1. Change the y-axis of the histogram to match the y-axis of the density.

Empirical vs Theoretical Distributions

Empirical: the observed data.

Code
fake_data %>%
  ggplot(aes(x = height)) +
  geom_histogram(aes(y = ..density..), bins = 10, color = "white") 

Theoretical: the distribution curve.

Code
fake_data %>%
  ggplot(aes(x = height)) +
  stat_function(fun = ~dnorm(.x, mean = 67, sd = 3),
                col = "steelblue", lwd = 2) +
  stat_function(fun = ~dnorm(.x, mean = 67, sd = 2),
                col = "orange2", lwd = 2)

In-line Code

We can automatically include code output in the written portion of a Quarto document using `r `.

  • This ensures reproducibility when you have results from a random generation process.
my_rand <- rnorm(1, mean = 0, sd = 1)
my_rand
[1] -1.165964

Type this: My random number is `r my_rand`.

To get this: My random number is -1.1659638.

To do…

  • PA 9.1: Mystery Animal – due Friday, 6/2 at 11:59pm.
  • PA 9.2: Instrument Con – due Friday, 6/2 at 11:59pm.
  • Lab 9: Baby Names – due Friday, 6/2 at 11:59pm.
  • Challenge 9: Formatting Nice Tables – due Saturday, 6/3 at 11:59pm.
  • Project: Linear Regression – due Monday, 6/5 at 11:59pm.
  • Read Chapter 10 and complete Check-in 10.1 – due Monday, 6/5 at 10:00am.